Prediction is difficult, especially about the future.

— Niels Bohr (apocryphally)

Goals:

Time series format and plotting

A time series is a special data set where each observation has an associated time measurement. There is a special R structure for storing and operating on time series, called ts, as illustrated here:

births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
birthstimeseries <- ts(births, frequency=12, start=c(1946,1))
birthstimeseries
##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1946 26.663 23.598 26.931 24.740 25.806 24.364 24.477 23.901 23.175 23.227
## 1947 21.439 21.089 23.709 21.669 21.752 20.761 23.479 23.824 23.105 23.110
## 1948 21.937 20.035 23.590 21.672 22.222 22.123 23.950 23.504 22.238 23.142
## 1949 21.548 20.000 22.424 20.615 21.761 22.874 24.104 23.748 23.262 22.907
## 1950 22.604 20.894 24.677 23.673 25.320 23.583 24.671 24.454 24.122 24.252
## 1951 23.287 23.049 25.076 24.037 24.430 24.667 26.451 25.618 25.014 25.110
## 1952 23.798 22.270 24.775 22.646 23.988 24.737 26.276 25.816 25.210 25.199
## 1953 24.364 22.644 25.565 24.062 25.431 24.635 27.009 26.606 26.268 26.462
## 1954 24.657 23.304 26.982 26.199 27.210 26.122 26.706 26.878 26.152 26.379
## 1955 24.990 24.239 26.721 23.475 24.767 26.219 28.361 28.599 27.914 27.784
## 1956 26.217 24.218 27.914 26.975 28.527 27.139 28.982 28.169 28.056 29.136
## 1957 26.589 24.848 27.543 26.896 28.878 27.390 28.065 28.141 29.048 28.484
## 1958 27.132 24.924 28.963 26.589 27.931 28.009 29.229 28.759 28.405 27.945
## 1959 26.076 25.286 27.660 25.951 26.398 25.565 28.865 30.000 29.261 29.012
##         Nov    Dec
## 1946 21.672 21.870
## 1947 21.759 22.073
## 1948 21.059 21.573
## 1949 21.519 22.025
## 1950 22.084 22.991
## 1951 22.964 23.981
## 1952 23.162 24.707
## 1953 25.246 25.180
## 1954 24.712 25.688
## 1955 25.693 26.881
## 1956 26.291 26.987
## 1957 26.634 27.735
## 1958 25.912 26.619
## 1959 26.992 27.897

This reads in a data set of number of births per month in New York City from 1946 to 1958 (it’s not clear what the units are - mabe thousands of births?) To create the time series, we had to give the function the frequency, or the number of time points in a year, and the starting value as a vector assigned to start=c(1946, 1), the first element is the year and the second the month.

Here are two different time series, of diabetic drug sales in Australia (in millions of AUS dollaors), also with monthly frequency, and of Boston marathon winning times, with yearly frequency:

#a10

#marathon

Visualizing the data

The most straighforward way of visualizing time series is using a time plot, which can be created using autoplot:

autoplot(birthstimeseries) +
  ggtitle("Number of births in NYC") +
  ylab("Births (thousands") +
  xlab("Year")

autoplot(a10) +
  ggtitle("Antidiabetic drug sales") +
  ylab("AUS$ (million)") +
  xlab("Year")

autoplot(marathon) +
  ggtitle("Boston marathon winning times") +
  ylab("Time (minutes)") +
  xlab("Year")

Correlations of time series: cross-, auto-, and lag plot

Visualizing correlation between different variables

The following data set contains the number of visitors (visitor nights) on a quarterly basis for five regions of New South Wales, Australia:

autoplot(visnights[,1:5], facets=TRUE) +
  ylab("Number of visitor nights each quarter (millions)")

One simple question is whether different variables are related to each other. One simple way is to calculate the Pearson correlation between different time series, called the cross-correlation (where \(\bar X\) stands for the mean of X and \(Var(X)\) stands for the variance of \(X\)):

\[ Cor(X,Y) = \frac{\sum_t (\bar X - X_t)(\bar Y - Y_t)}{\sqrt{Var(X)Var(Y)}} \]

In a data set with multiple variables it can be handy to examine the correlations between all pairs between them. Here’s a convenient function for that:

head(visnights)
##         NSWMetro NSWNthCo NSWSthCo NSWSthIn NSWNthIn  QLDMetro QLDCntrl
## 1998 Q1 9.047095 8.565678 5.818029 2.679538 2.977507 12.106052 2.748374
## 1998 Q2 6.962126 7.124468 2.466437 3.010732 3.477703  7.786687 4.040915
## 1998 Q3 6.871963 4.716893 1.928053 3.328869 3.014770 11.380024 5.343964
## 1998 Q4 7.147293 6.269299 2.797556 2.417772 3.757972  9.311460 4.260419
## 1999 Q1 7.956923 9.493901 4.853681 3.224285 3.790760 12.671942 4.186113
## 1999 Q2 6.542243 5.401201 2.759843 2.428489 3.395284  9.582965 4.237806
##         QLDNthCo SAUMetro SAUCoast  SAUInner VICMetro  VICWstCo VICEstCo
## 1998 Q1 2.137234 2.881372 2.591997 0.8948773 7.490382 2.4420048 3.381972
## 1998 Q2 2.269596 2.124736 1.375780 0.9792509 5.198178 0.9605047 1.827940
## 1998 Q3 4.890227 2.284870 1.079542 0.9803289 5.244217 0.7559744 1.351952
## 1998 Q4 2.621548 1.785889 1.497664 1.5094343 6.274246 1.2716040 1.493415
## 1999 Q1 2.483203 2.293873 2.247684 0.9635227 9.187422 2.3850583 2.896929
## 1999 Q2 3.377830 2.197418 1.672802 0.9968803 4.992303 1.3288638 1.547901
##         VICInner WAUMetro WAUCoast  WAUInner OTHMetro OTHNoMet
## 1998 Q1 5.326655 3.075779 3.066555 0.6949954 3.437924 2.073469
## 1998 Q2 4.441119 2.154929 3.334405 0.5576796 2.677081 1.787939
## 1998 Q3 3.815645 2.787286 4.365844 1.0061844 3.793743 2.345021
## 1998 Q4 3.859567 2.752910 4.521996 1.1725514 3.304231 1.943689
## 1999 Q1 4.588755 3.519564 3.579347 0.3981829 3.510819 2.165838
## 1999 Q2 4.070401 3.160430 3.408533 0.5960182 2.871867 1.803940
GGally::ggpairs(as.data.frame(visnights[,1:5]))

Autocorrelation

A time series can be correlated against itself shifted in time by some set amount, also called lagged. We can plot the lagged correlations for different visitor

visnights[,1]
##          Qtr1     Qtr2     Qtr3     Qtr4
## 1998 9.047095 6.962126 6.871963 7.147293
## 1999 7.956923 6.542243 6.330364 7.509212
## 2000 7.662491 6.341802 7.827301 9.579562
## 2001 8.270488 7.240427 6.640490 7.111875
## 2002 6.827826 6.404992 6.615760 7.226376
## 2003 7.589058 6.334527 5.996748 6.612846
## 2004 7.758267 6.778836 5.854452 6.200214
## 2005 7.163830 5.082204 5.673551 6.089906
## 2006 8.525916 6.569684 5.771059 6.692897
## 2007 8.158658 5.710082 5.402543 6.803494
## 2008 7.866269 5.616704 5.886764 5.506298
## 2009 6.787225 5.361317 4.699350 6.208784
## 2010 7.148262 4.850217 6.029490 6.238903
## 2011 7.597468 5.815930 6.183239 5.929030
## 2012 7.986931 5.307871 6.054112 6.023897
## 2013 7.028480 5.813450 6.322841 7.101691
## 2014 7.897316 5.997468 6.033533 7.103398
## 2015 8.725132 6.995875 6.294490 6.945476
## 2016 7.373757 6.792234 6.530568 7.878277
visnights_smaller <- window(visnights[,1], start=2000, end = 2010)
gglagplot(visnights_smaller) 

Here the colors indicate the quarter of the variable on the vertical axis, compared with the shifted (lagged variable on the horizontal axis, and the lines connect points in chronological order. The relationship is strongly positive at lags 4 and 8, reflecting the strong seasonality in the data.

This suggests that there is a strong similarity between the time series and itself, shifted by certain time values. This is described by the autocorrelation, which is defined as a function of the lag \(k\):

$$ r(k) =

$$ This can be calculated and plotted for our example of the visitation nights in New South Wales:

ggAcf(visnights_smaller)

Notice the periodicity in the autocorrelation, which indicated periodicity in the time series. Let’s similarly calculate the autocorrelation of the drug sales data:

ggAcf(a10)

Notice how different this correlogram is - there are no zero values of autcorrelation, only slow decay with some small periodic components.

Autocorrelation measures the memory of a signal - for example, pure white noise is uncorrelated with itself even a moment later, and thus has no memory. As such, it is very useful as a measure of a trend in the data - if the time series has slowly decaying, positive autocorrelation, that indicates a pronounced trend, while periodicity indicates seasonality in the data.

Exercise: Use the lag and autocorrelation analysis to describe the patterns in the time series of births in NYC and in the Boston marathon winning times.

Decomposition of time series

There are two main types of decompositions of time series: additive and multiplicative. Let us call \(X_t\) the time series data, \(T_t\) the trend (non-periodic component), \(S_t\) the seasonal part (periodic component), and \(R_t\) the remainder.

\[ X_t = T_t + S_t + R_t \] \[ X_t = T_t \times S_t \times R_t \]

One simple way of removing seasonality and estimating the trend is using the moving average, that is using \(k\) points before and \(k\) points after each point to calculate the trend: \[ T_t = \frac{1}{m} \sum_{i=-k}^k X_{t+i} \]

Here \(m\) is called the order of the moving average and is defined as \(m = 2k+1\). There is a useful function ma() that calculates these averages and allows them to be plotted.

m <- 3
s_name <- paste("MA-", m)
autoplot(a10, series = "data") +
  autolayer(ma(a10, m), series = "MA") +
  xlab("Year") + ylab("AUS $ (millions)") +
  ggtitle("Anti-diabetic drug sales") 
## Warning: Removed 2 rows containing missing values (geom_path).

Exercise: Change the moving average window and see if you can make seasonality vanish!

An even order of periodicity requires an asymmetric averaging window, so to create an symmetric average, one can repeat the moving average of order two on the already-averaged data.

Classic decomposition:

Additive decomposition [1]:

  1. If m is an even number, compute the trend-cycle component \(T_t\) using a 2×m-MA. If m is an odd number, compute the trend-cycle component \(\hat T_t\) using an m-MA.
  2. Calculate the detrended series: \(X_t - \hat T_t\)
  3. To estimate the seasonal component for each season, average the detrended values for that season. For example, with monthly data, the seasonal component for March is the average of all the detrended March values in the data. These seasonal component values are then adjusted to ensure that they add to zero. The seasonal component is obtained by stringing together these monthly values, and then replicating the sequence for each year of data. This gives \(\hat S_t\).
  4. The remainder component is calculated by subtracting the estimated seasonal and trend-cycle components: $ R_t = X_t - T_t - S_t$
a10 %>% decompose(type="additive") %>%
  autoplot() + xlab("Year") +
  ggtitle("Classical additive decomposition
    of the antidiabetic drug sales data")

This simple classical decomposition has numerous flaws, so better, more modern methods are preferred. In particular, it assumes a constant seasonal term, it tends to over-estimate the variation in the trend, it misses data for the first few and last few data points, and can be sensitive to outliers.

STL decomposition

A more robust method is called the STL decomposition (Seasonal and Trend decomposition using Loess). To summarize its advantanges [1]:

  • STL can handle any type of seasonality, not only monthly and quarterly data.
  • The seasonal component is allowed to change over time, and the rate of change can be controlled by the user.
  • The smoothness of the trend-cycle can also be controlled by the user.
  • It can be robust to outliers (i.e., the user can specify a robust decomposition), so that occasional unusual observations will not affect the estimates of the trend-cycle and seasonal components. They will, however, affect the remainder component.
a10 %>%  stl(t.window=13, s.window="periodic", robust=TRUE) %>%
  autoplot()

Exercise: Apply the two decomposition methods to the Boston marathon and to births in NYC time series.

Regression methods

Let us analyze the data set of US quarterly economic data, specifically, the percent change in consumption, income, production, savigs, and unemployment.

head(uschange)
##         Consumption     Income Production   Savings Unemployment
## 1970 Q1   0.6159862  0.9722610 -2.4527003 4.8103115          0.9
## 1970 Q2   0.4603757  1.1690847 -0.5515251 7.2879923          0.5
## 1970 Q3   0.8767914  1.5532705 -0.3587079 7.2890131          0.5
## 1970 Q4  -0.2742451 -0.2552724 -2.1854549 0.9852296          0.7
## 1971 Q1   1.8973708  1.9871536  1.9097341 3.6577706         -0.1
## 1971 Q2   0.9119929  1.4473342  0.9015358 6.0513418         -0.1
autoplot(uschange[,c("Consumption","Income")]) +
  ylab("% change") + xlab("Year")

uschange %>%
  as.data.frame() %>%
  ggplot(aes(x=Income, y=Consumption)) +
    ylab("Consumption (quarterly % change)") +
    xlab("Income (quarterly % change)") +
    geom_point() +
    geom_smooth(method="lm", se=FALSE)

uschange %>%
  as.data.frame() %>%
  GGally::ggpairs()

Let us use four variables as predictors of consumption to calculate a multiple linear regression model using the function tslm():

fit.consMR <- tslm(
  Consumption ~ Income + Production + Unemployment + Savings,
  data=uschange)
summary(fit.consMR)
## 
## Call:
## tslm(formula = Consumption ~ Income + Production + Unemployment + 
##     Savings, data = uschange)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.88296 -0.17638 -0.03679  0.15251  1.20553 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.26729    0.03721   7.184 1.68e-11 ***
## Income        0.71449    0.04219  16.934  < 2e-16 ***
## Production    0.04589    0.02588   1.773   0.0778 .  
## Unemployment -0.20477    0.10550  -1.941   0.0538 .  
## Savings      -0.04527    0.00278 -16.287  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3286 on 182 degrees of freedom
## Multiple R-squared:  0.754,  Adjusted R-squared:  0.7486 
## F-statistic: 139.5 on 4 and 182 DF,  p-value: < 2.2e-16

We can produce a plot of the predicted values together with the observed data on consumption:

autoplot(uschange[,'Consumption'], series="Data") +
  autolayer(fitted(fit.consMR), series="Fitted") +
  xlab("Year") + ylab("") +
  ggtitle("Percent change in US consumption expenditure") +
  guides(colour=guide_legend(title=" "))

It is useful to check the residuals of the regression model:

checkresiduals(fit.consMR)

## 
##  Breusch-Godfrey test for serial correlation of order up to 8
## 
## data:  Residuals from Linear regression model
## LM test = 14.874, df = 8, p-value = 0.06163

The perennial warning: beware of spurious correlations!

These two data sets, on Australian air passengers and rice production in Guinea, have a very strong positive correlation:

aussies <- window(ausair, end=2011)
fit <- tslm(aussies ~ guinearice)
summary(fit)
## 
## Call:
## tslm(formula = aussies ~ guinearice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9448 -1.8917 -0.3272  1.8620 10.4210 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.493      1.203  -6.229 2.25e-07 ***
## guinearice    40.288      1.337  30.135  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.239 on 40 degrees of freedom
## Multiple R-squared:  0.9578, Adjusted R-squared:  0.9568 
## F-statistic: 908.1 on 1 and 40 DF,  p-value: < 2.2e-16

However, notice that the residuals indicate a strong trend, which violates the assumptions of linear regression.

checkresiduals(fit)

## 
##  Breusch-Godfrey test for serial correlation of order up to 8
## 
## data:  Residuals from Linear regression model
## LM test = 28.813, df = 8, p-value = 0.000342

There are a number of fun examples of spurious time series correlations in reference [5].

Forecasting using linear regression

One can distringuish between true forecasting, termed ex-ante (from before) prediction, in which we truly try to predict the unknown future, and ex-post forecasts, in which the true values both of the predictors and the response variable are known. The latter is still useful for validating models and for comparing different methods.

The library forecast contains tools to make calculating predicted values in time series simple. One can use the model to forecast the values in the future, based on different scenarios. For example, we may want to investigate the prediction for an economic upturn and a downturn:

fit.consBest <- tslm(
  Consumption ~ Income + Savings + Unemployment,
  data = uschange)

h <- 4
newdata <- data.frame(
    Income = c(1, 1, 1, 1),
    Savings = c(0.5, 0.5, 0.5, 0.5),
    Unemployment = c(0, 0, 0, 0))
fcast.up <- forecast::forecast(fit.consBest, newdata = newdata)
newdata <- data.frame(
    Income = rep(-1, h),
    Savings = rep(-0.5, h),
    Unemployment = rep(0, h))
fcast.down <- forecast::forecast(fit.consBest, newdata = newdata)
# This script does not work for some reason!
autoplot(uschange[, 1]) +
  ylab("% change in US consumption") +
  autolayer(fcast.up, PI = TRUE, series = "increase")  +
  autolayer(fcast.down, PI = TRUE, series = "decrease") +
  guides(colour = guide_legend(title = "Scenario"))

References and further reading:

  1. Rob J Hyndman and George Athanasopoulos. Forecasting: Principles and Practice
  2. Jonathan Cryer and Kung-Sik Chan Time Series Analysis with Applications in R
  3. Cross-validation in forecasting
  4. Time series nested cross-validation
  5. Spurious correlations